Hidden Markov Model
Essence
Model: States are hidden.
$$ \begin{aligned} \begin{matrix} \text{States}: & \boxed{x_1} & \boxed{x_2} & \cdots & \boxed{x_T} \\ & \downarrow & \downarrow & \cdots & \downarrow \\ \text{Observations}: & y_1 & y_2 & \cdots & y_T \\ \end{matrix} \end{aligned} $$Goal: Determine the states of each moment given observations.
$$ \begin{aligned} \underline{x}^{*} &= \arg\max_{\underline{x}} P[\underline{y}] \\ &= \arg\max_{\underline{x}} \sum_{\underline{x}} P[\underline{y}|\underline{x}]P[\underline{x}] \end{aligned} $$Definition
$$ \lambda = (A, B, \pi) \text{ or } (N, M, A, B, \pi) $$- $ A $ : State Transition Matrix. $ N $ : States, $\{S_1, \dots, S_N\}$.
- $ a_{ij} = P[Q_{t+1} = S_j | Q_t = S_i]$.
- $B $ : Symbol Distribution. $ M $ : Symbols per State, $\{V_1, \dots, V_M\}$.
- $ b_{j}(k) = P[O_t = V_k|Q_t = S_j] $.
- $ \pi $ : Initial State Distribution.
- $ \pi_i = P[Q_1 = S_i] $
Goal
Given observations ($O_t$ is a symbol):
$$ O = \begin{matrix} O_1 & O_2 & \dots & O_T \end{matrix} $$We want: $$ \arg\max_{\lambda} P[O | \lambda] $$
Problems:
- Problem 1: A Hidden Model
- Problem 2: The Training Criteria (Reason for Forward-Backward Algorithm)
- Problem 3: Model Parameters
Problem 1: A Hidden Model
The Observation Sequence:
$$ \begin{aligned} O &= \begin{matrix} O_1 & O_2 &\cdots &O_T \end{matrix} \end{aligned} $$The State Sequence: (The Hidden Model) $$ \begin{aligned} Q &= \begin{matrix} Q_1 & Q_2 &\cdots &Q_T \end{matrix} \end{aligned} $$
Solution 1
Since:
$$ \begin{aligned} &\begin{cases} P[Q|\lambda] = \pi_{Q_1} \cdot a_{Q_1 Q_2} \cdot a_{Q_2 Q_3} \cdots a_{Q_{T-1} Q_T} \\ P[O|Q, \lambda] = \prod_{t=1}^{T} P[O_t | Q_t, \lambda] = b_{Q_1}(O_1) \cdot b_{Q_2}(O_2) \cdots \cdot b_{Q_T}(O_T) \\ \end{cases} \end{aligned} $$We have:
$$ \begin{aligned} P[O|\lambda] &= \sum_{Q} P[O, Q|\lambda] \\ &= \sum_{Q} P[O|Q, \lambda] \cdot P[Q|\lambda] \\ &= \sum_{Q_1 \cdots Q_T} [b_{Q_1}(O_1) \cdot b_{Q_2}(O_2) \cdots \cdot b_{Q_T}(O_T) ] \cdot [a_{Q_1 Q_2} \cdot a_{Q_2 Q_3} \cdots a_{Q_{T-1} Q_T}] \\ &= \sum_{Q_1 \cdots Q_T} [\pi_{Q_1} \cdot b_{Q_1}(O_1)] \cdot [a_{Q_1 Q_2} \cdot b_{Q_2}(O_2)] \cdots [a_{Q_{T-1} Q_T} \cdot b_{Q_T}(O_T)] \\ P[O|\lambda] &\sim (N^T \cdot 2T) \end{aligned} $$A simple $N = 5, T = 100$ would require $ 10^{72} $ computations. This solution is impractical.
Solution 2: Forward-Backward Procedure
Forward Phase
Definition: (Forward Variable)
$$ \begin{aligned} \alpha_{t}(i) &= P[O_1 O_2 \cdots O_t, Q_t = S_i | \lambda] \end{aligned} $$Procedure:
Initialization:
$$ \begin{aligned} \alpha_1(i) &= \pi_i \cdot b_{i}(O_1) \\ \alpha_1(i) &\sim [\times: 1] \\ \alpha_1 &\sim [\times: N] \end{aligned} $$
Induction:
$$ \begin{aligned} \alpha_{t+1}(j) &= \left[ \sum_{i=1}^{N} \alpha_{t}(i) \cdot a_{ij} \right] \cdot b_{j}(O_{t+1}) \\ \alpha_{t+1}(j) &\sim [\times: (N+1) ; +: (N-1)] \\ \alpha_{t+1} &\sim [\times:N (N+1) ; +: N (N-1)] \end{aligned} $$
Termination:
$$ \begin{aligned} P[O|\lambda] &= \sum_{i=1}^{N} \alpha_{T}(i) \end{aligned} $$
Complexity:
$$ \begin{aligned} \alpha_{[1, T]} &\sim [\times: (T-1) \cdot N \cdot (N + 1) + N; +: (T - 1) \cdot N \cdot (N - 1)] \end{aligned} $$A simple $ N = 5, T = 100$ requires $3000$ computations, comparing to $10^{72}$ computations previously.
The insight comes from the lattice structure due to Markov property (depending only on previous states, like Bayes net).
Backward Phase
Definition: (Backward Variable)
$$ \begin{aligned} \beta_t(i) &= P[O_{t+1} O_{t+2} \cdots O_T | Q_t = S_i, \lambda] \end{aligned} $$Procedure:
Initialization: $$ \begin{aligned} \beta_{T}(i) &= 1 \end{aligned} $$
Induction: $$ \begin{aligned} \beta_{t}(i) &= \sum_{j=1}^{N} \beta_{t+1}(j) \cdot b_j(O_{t+1}) \cdot a_{ij} \\ \beta_{t}(i) &\sim [\times:2N; +: (N - 1)] \\ \beta_{t} &\sim [\times: 2N^2; +: N(N-1)] \end{aligned} $$
Complexity:
$$ \begin{aligned} \beta_{[1, T]} \sim [\times: 2N^2T; + : TN(N - 1)] \end{aligned} $$Problem 2: The Training Criteria
Target: Given observation, find the mostly likely state at $ t $. (A New Goal)
$$ \begin{aligned} Q_t &= \arg\max_{i \in [1, N]} P[Q_t = S_i | O, \lambda] \\ &= \arg\max_{i \in [1, N]} \gamma_{t(i)} \end{aligned} $$Then:
$$ \begin{aligned} \gamma_{t}(i) &= P[Q_t = S_i | O, \lambda] = \frac{P[Q_t = S_i, O | \lambda]}{P[O | \lambda]} \\ &= \frac{P[Q_t = S_i, O_1 \cdots O_t O_{t+1} \cdots O_T | \lambda]}{P[O | \lambda]} \\ &= \frac{P[O_{t+1} \cdots O_T | Q_t = S_i, O_1 \cdots O_t, \lambda] \cdot P[Q_t = S_i, O_1 \cdots O_t | \lambda]}{P[O | \lambda]} \\ &= \frac{P[O_{t+1} \cdots O_T | Q_t = S_i, \lambda] \cdot P[Q_t = S_i, O_1 \cdots O_t | \lambda]}{P[O | \lambda]} \\ &= \frac{\beta_t(i) \cdot \alpha_t(i)}{P[O | \lambda]} = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j=1}^{N} P[Q_t = S_j, O | \lambda]} = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j=1}^{N} \alpha_t(j) \cdot \beta_t(j)} \\ \end{aligned} $$where:
$$ \begin{aligned} &P[O_1 \cdots O_{t+1}, Q_{t+1} = S_j | \lambda] \\ &= \sum_{i=1}^{N} P[O_1 \cdots O_t O_{t+1}, Q_t = S_i, Q_{t+1} = S_j | \lambda] \\ &= \sum_{i=1}^{N} P[O_{t+1}, Q_{t+1} = S_j | O_1 \cdots O_t, Q_t = S_i, \lambda] \cdot P[O_1 \cdots O_t, Q_t = S_i | \lambda] \\ &= \sum_{i=1}^{N} P[O_{t+1}, Q_{t+1} = S_j | Q_t = S_i, \lambda] \cdot P[O_1 \cdots O_t, Q_t = S_i | \lambda] \\ &= \sum_{i=1}^{N} P[O_{t+1} | Q_{t+1} = S_j, Q_t = S_i, \lambda] \cdot P[Q_{t+1} = S_j | Q_t = S_i, \lambda] \cdot \alpha_{t}(i) \\ &= \sum_{i=1}^{N} P[O_{t+1} | Q_{t+1} = S_j, \lambda] \cdot P[Q_{t+1} = S_j | Q_t = S_i, \lambda] \cdot \alpha_{t}(i) \\ &= \sum_{i=1}^{N} b_{j}(O_{t+1}) \cdot a_{ij} \cdot \alpha_{t}(i) \\ &= b_{j}(O_{t+1}) \cdot \sum_{i=1}^{N} a_{ij} \cdot \alpha_{t}(i) \\ &= \alpha_{t+1}(j) \end{aligned} $$and:
$$ \begin{aligned} &P[O_{t+1} \cdots O_T | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} P[O_{t+1} O_{t+2} \cdots O_T, Q_{t+1} = S_j | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} P[O_{t+2} \cdots O_T = S_j | O_{t+1}, Q_{t+1} = S_j, Q_{t} = S_i] \cdot P[O_{t+1}, Q_{t+1} = S_j | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} P[O_{t+2} \cdots O_T = S_j | Q_{t+1} = S_j] \cdot P[O_{t+1} | Q_{t+1} = S_j, Q_{t} = S_i] \cdot P[Q_{t+1} = S_j | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} \beta_{t+1}(j) \cdot P[O_{t+1} | Q_{t+1} = S_j] \cdot P[Q_{t+1} = S_j | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} \beta_{t+1}(j) \cdot b_{j}(O_{t+1}) \cdot a_{ij} \\ &= \beta_{t}(i) \end{aligned} $$So the former probability is computed forwardly, the latter probability is computed backwardly.
Note that $ \gamma_t(i) $ is also called the smoothed value. This process of using both $ \alpha_t(i) $ and $ \beta_t(i) $ is called smoothing.
Problem 3: Determine Model Parameters
There is no optimal way of estimating the model parameters. There is iterative procedure that reach local maximization(EM algorithm), or gradient techniques.
Iterative Procedure:
Define:
$$ \begin{aligned} \xi_{t}(i, j) &= P[Q_{t+1} = S_j, Q_{t} = S_i | O, \lambda] \\ &= \frac{P[Q_{t+1} = S_j, Q_{t} = S_i , O_1\cdots O_{T} | \lambda]}{P[O|\lambda]} \\ &= \frac{P[O_{t+2}\cdots O_T | Q_{t+1} = S_j, \lambda] \cdot P[O_{t+1} | Q_{t+1} = S_j, \lambda ] \cdot P[Q_{t+1} = S_j| Q_{t} = S_i, \lambda] \cdot P[Q_t = S_i, O_1\cdots O_{t} | \lambda]}{P[O|\lambda]} \\ &= \frac{\beta_{t+1}(j) \cdot b_{j}(O_{t+1}) \cdot a_{ij} \cdot \alpha_{t}(i)}{P[O|\lambda]} \\ &= \frac{\beta_{t+1}(j) \cdot b_{j}(O_{t+1}) \cdot a_{ij} \cdot \alpha_{t}(i)}{\sum_{j=1}^{N} \sum_{i=1}^{N} \beta_{t+1}(j) \cdot b_{j}(O_{t+1}) \cdot a_{ij} \cdot \alpha_{t}(i)} \\ \end{aligned} $$Then:
$$ \begin{aligned} \gamma_t(i) &= P[Q_t = S_i | O, \lambda] \\ &= \sum_{j=1}^{N} P[Q_{t+1} = S_j, Q_t = S_i | O, \lambda] \\ &= \sum_{j=1}^{N} \xi_{t}(i, j) \end{aligned} $$Statistical Meaning:
$$ \begin{aligned} \sum_{t=1}^{T - 1} \gamma_{t}(i) &= \mathbb{E}[\text{\#Transitions from $S_i$}] \\ \sum_{t=1}^{T - 1} \xi_{t}(i, j) &= \mathbb{E}[\text{\#Transitions from $S_i$ to $S_j$}] \\ \end{aligned} $$Re-estimation Formulas:
$$ \begin{aligned} \overline{\pi_i} &= \gamma_{1}(i) \\ \overline{a}_{ij} &= \frac{\mathbb{E}[\text{\#Transitions from $S_i$ to $S_j$}]}{\mathbb{E}[\text{\#Transitions from $S_i$}]} = \frac{\sum_{t=1}^{T - 1} \xi_{t}(i, j)}{\sum_{t=1}^{T - 1} \gamma_{t}(i)} \\ \overline{b}_j(k) &= \frac{\mathbb{E}[\text{\#In $S_j$, observing $V_k$}]}{\mathbb{E}[\text{\#In $S_j$}]} = \frac{\sum_{t=1}^{T}1_{[O_t = V_k]} \cdot \gamma_{t}(j)}{\sum_{t=1}^{T} \gamma_{t}(j)} \end{aligned} $$Procedure:
- Use initial $ \lambda $ to compute a $ \overline{\lambda} $.
- If $ P[O | \overline{\lambda}] > P[O | \lambda] $, then update it and repeat.
Stochastic Constraints:
$$ \begin{aligned} \sum_{i=1}^{N} \overline{\pi}_i &= 1 \\ \sum_{j=1}^{N} \overline{a}_{ij} &= 1 \\ \sum_{k=1}^{M} \overline{b}_j(k) &= 1 \end{aligned} $$